Skip to content

Upgrade fluent-bit (Linux 5.0.4, Windows 5.0.3)#1671

Merged
zanejohnson-azure merged 1 commit into
ci_prodfrom
zanejohnson-azure/upgrade-fluent-bit
May 14, 2026
Merged

Upgrade fluent-bit (Linux 5.0.4, Windows 5.0.3)#1671
zanejohnson-azure merged 1 commit into
ci_prodfrom
zanejohnson-azure/upgrade-fluent-bit

Conversation

@zanejohnson-azure
Copy link
Copy Markdown
Contributor

@zanejohnson-azure zanejohnson-azure commented May 5, 2026

Summary

Upgrades fluent-bit to the latest available versions on each platform.

Platform File Before After
Linux kubernetes/linux/setup.sh azcu-fluent-bit-4.0.14 azcu-fluent-bit-5.0.4
Windows kubernetes/windows/setup.ps1 fluent-bit-4.0.3-win64.zip fluent-bit-5.0.3-win64.zip

Notes

  • Linux: The azcu-fluent-bit-5.0.4 spec is already published in Azure/dalec-build-defs#7471, so no companion PR is needed there.
  • Windows: 5.0.3 is the latest version currently published at https://packages.fluentbit.io/windows/. Upstream 5.0.4 has been released but the Windows zip has not been pushed yet.

fluent-bit 4.0.14 → 5.0.x Upgrade — Validation Summary

PR: #1671 (zanejohnson-azure/upgrade-fluent-bit)
Changes:

  • Linux: kubernetes/linux/setup.shazcu-fluent-bit-5.0.4
  • Windows: kubernetes/windows/setup.ps1fluent-bit-5.0.3-win64.zip (5.0.4 not yet published upstream — will bump when available)

Verdict: ✅ No regressions. Safe to merge once Windows 5.0.4 lands.


How it was validated

cpu and memory usage after and before

First half is from new fluent-bit and second half is from current fluent-bit version.

linux

both cpu and memory usage actually dropped in new version.
image

windows

cpu slightly dropped and memory usage slightly increased in new fluent-bit version. There was a transient memory jump for a couple of minutes in new fluent-bit version, but dropped quickly to a stable level.
image

A/B backdoor deployment on AKS cluster zane-ama-logs-helm-test (LA workspace 222e72f7-1ad8-4e28-b2a9-07d046eedef4):

  • Baseline: ciprod:3.3.0 (current prod, fluent-bit 4.0.14)
  • Test: cidev:3.3.0-6-g1d77401ab-20260506045747 (this PR, fluent-bit 5.0.4 Linux)

Data-volume parity (1-min bins, 5-min window)

Table Prod Test Delta Verdict
ContainerInventory 822 825 +3 ✅ PASS — 3 extra rows = unrelated azsecpack-azl3-image (Azure Security Linux DaemonSet) snapshots
KubeNodeInventory 25 25 0 ✅ PASS
KubePodInventory 825 826 +1 ✅ PASS — within noise
InsightsMetrics 825 825 0 ✅ PASS
Perf 5827 5835 +8 ✅ PASS — matches azsecpack containers above
ContainerLogV2 varies varies ✅ PASS — no sustained drop/spike

Per-container investigation confirmed the small deltas are unrelated cluster churn (azsecpack), not the ama-logs code change.


Functional smoke tests

# Area Status Notes
1 Pod startup / no crashloop ✅ PASS 0 restarts across all ama-logs pods over 41+ h soak
2 ContainerLogV2 ingestion ✅ PASS Stream uninterrupted across rollout
3 KubePodInventory / Inventory ingestion ✅ PASS Counts match prod within noise
4 InsightsMetrics / Perf ingestion ✅ PASS Counts match prod within noise
5 KubeEvents ✅ PASS No change in event flow
6 DaemonSet rollout ✅ PASS kubectl rollout restart ds/ama-logs clean, 4/4 pods Ready
7 Prometheus scraping (ama-logs-prometheus container) ✅ PASS No regression in metrics scrape
8 Multiline (Java stack trace) PASS See section below — explicit before/after test

Multiline stack-trace grouping (the highest-risk 5.x area)

fluent-bit 5.x rewrote in_emitter and the multiline buffering path. ama-logs gates multiline behind a customer opt-in (#${MultilineEnabled} token in fluent-bit.conf stripped by fluent-bit-conf-customizer.rb when enable_multiline_logs.enabled = "true"). Tested explicitly:

Method: Deployed a busybox pod (multiline-emitter) that prints a 7-line Java NPE + Caused by: chain every 30 s, tagged iter-N. Captured ContainerLogV2 rows before (multiline OFF) and after (configmap patch enabling multiline for java).

Mode Iters in window ContainerLogV2 rows Rows / trace
BEFORE (default — multiline OFF) 11 77 7 (one row per \n-terminated line)
AFTER (configmap opt-in, java) 9 9 1 (full trace collapsed)

Sample AFTER row (iter-19 LogMessage):

java.lang.NullPointerException: multiline-marker iter-19
	at com.example.Foo.bar(Foo.java:42)
	at com.example.Foo.baz(Foo.java:13)
	at com.example.Main.main(Main.java:7)
Caused by: java.io.IOException: disk read failed iter-19
	at com.example.Disk.read(Disk.java:99)
	at com.example.Foo.bar(Foo.java:40)

Result: Multiline grouping works as documented on 5.0.4. No partial traces, no dropped continuation lines, no duplicate emissions.


- Linux: azcu-fluent-bit 4.0.14 -> 5.0.4 (spec already published in dalec-build-defs)

- Windows: fluent-bit 4.0.3 -> 5.0.3 (latest available zip on packages.fluentbit.io)

Co-authored-by: Copilot <[email protected]>
@zanejohnson-azure zanejohnson-azure requested a review from a team as a code owner May 5, 2026 17:38
@suyadav1
Copy link
Copy Markdown
Contributor

  1. Did we validate the telemetry flowing as expected?
  2. For multiline, we have tests in scenarios. Could you please check those too?

@suyadav1
Copy link
Copy Markdown
Contributor

Please share the resource usage as well.

Comment thread kubernetes/windows/setup.ps1
@zanejohnson-azure
Copy link
Copy Markdown
Contributor Author

zanejohnson-azure commented May 11, 2026

  • Did we validate the telemetry flowing as expected?
  • For multiline, we have tests in scenarios. Could you please check those too?
  1. yep.
    // Exceptions are empty
    exceptions
    | where timestamp > ago(2h)
    | where tostring(customDimensions.ID) =~ "/subscriptions/xx/resourcegroups/zane-rg/providers/Microsoft.ContainerService/managedClusters/zane-ama-logs-helm-test"
    | project timestamp, type, outerMessage, customDimensions

// see metrics are flowing
customMetrics
| where timestamp > ago(1h)
| where tostring(customDimensions.ID) =~ "/subscriptions/xx/resourcegroups/zane-rg/providers/Microsoft.ContainerService/managedClusters/zane-ama-logs-helm-test"
| summarize Count=count(), Last=max(timestamp) by name, tostring(customDimensions.Version), tostring(customDimensions.Controller)
| order by Count desc

  1. tested all 4 languages. no multiline behavior change in new image.

@zanejohnson-azure
Copy link
Copy Markdown
Contributor Author

Please share the resource usage as well.

updated in the pr

@zanejohnson-azure zanejohnson-azure merged commit 231b8d8 into ci_prod May 14, 2026
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants